8 research outputs found

    MaxPart: An Efficient Search-Space Pruning Approach to Vertical Partitioning

    Get PDF
    Vertical partitioning is the process of subdividing the attributes of a relation into groups, creating fragments. It represents an effective way of improving performance in the database systems where a significant percentage of query processing time is spent on the full scans of tables. Most of proposed approaches for vertical partitioning in databases use a pairwise affinity to cluster the attributes of a given relation. The affinity measures the frequency of accessing simultaneously a pair of attributes. The attributes having high affinity are clustered together so as to create fragments containing a maximum of attributes with a strong connectivity. However, such fragments can directly and efficiently be achieved by the use of maximal frequent itemsets. This technique of knowledge engineering reflects better the closeness or affinity when more than two attributes are involved. The partitioning process can be done faster and more accurately with the help of such knowledge discovery technique of data mining. In this paper, an approach based on maximal frequent itemsets to vertical partitioning is proposed to efficiently search for an optimized solution by judiciously pruning the potential search space. Moreover, we propose an analytical cost model to evaluate the produced partitions. Experimental studies show that the cost of the partitioning process can be substantially reduced using only a limited set of potential fragments. They also demonstrate the effectiveness of our approach in partitioning small and large tables

    Models to evaluate schemes for an early detection of breast cancer

    No full text
    Prevention of cancer in general, and breast cancer in particular is not possible at present. In the absence of prevention an early detection through screening is sometimes an attractive course of action. Evidence that sreening for an early detection of breast cancer may be worthwhile has been accumulating since an early study in New York (the HIP study, 1963). The decision to screen a community has a number of serious implications. Having established that early detection is beneficial, mathematical and simulation models are needed to assess and evaluate different screening programs in terms of the stage of disease at detection and the cost of the tests every woman receives. Models are developed to assist public policymakers in assessing screening schemes for an early detection of breast cancer. The impact of changes in the combination of up to three screening procedures used, the frequency with which each procedure is used, and the period of time during which screening is offered, can be investigated. Policies are evaluated in terms of progress of disease at detection, expected cost of the tests, the probability of detection by screening, and the number of false positives produced. The model is implemented on an IBM-PC computer. The program offers an interactive and user-friendly environment so that different schemes can be tried easily. (D76677)</p

    OLAP Textual Aggregation Approach using the Google Similarity Distance

    No full text
    International audienceData warehousing and On-Line Analytical Processing (OLAP) are essential elements to decision support. In the case of textual data, decision support requires new tools, mainly textual aggregation functions, for better and faster high level analysis and decision making. Such tools will provide textual measures to users who wish to analyse documents online. In this paper, we propose a new aggregation function for textual data in an OLAP context based on the K-means method. This approach will highlight aggregates semantically richer than those provided by classical OLAP operators. The distance used in K-means is replaced by the Google similarity distance which takes into account the semantic similarity of keywords for their aggregation. The performance of our approach is analyzed and compared to other methods such as Topkeywords, TOPIC, TuBE and BienCube. The experimental study shows that our approach achieves better performances in terms of recall, precision,F-measure complexity and runtime

    A Constraint-based Mining Approach for Multi-attribute Index Selection

    No full text
    International audienceThe index selection problem (ISP) concerns the selection of an appropriate indexes set to minimize the total cost for a given workload under storage constraint. Since the ISP has been proven to be an NP-hard problem, most studies focus on heuristic algorithms to obtain approximate solutions. The problem becomes more difficult for indexes defined on multiple tables such as bitmap join indexes, since it requires the exploration of a large search space. Studies dealing with the problem of selecting bitmap join indexes mainly focused on proposing pruning solutions of the search space by the means of data mining techniques or heuristic strategies. The main shortcoming of these approaches is that the indexes selection process is performed in two steps. The generation of a large number of indexes is followed by a pruning phase. An alternative is to constrain the input data earlier in the selection process thereby reducing the output size to directly discover indexes that are of interest for the administrator. For example, to select a set of indexes, the administrator may put limits on the number of attributes or the cardinality of the attributes to be included in the indexes configuration he is seeking. In this paper we addressed the bitmap join indexes selection problem using a constraint-based approach. Unlike previous approaches, the selection is performed in one step by introducing constraints in the selection process. The proposed approach is evaluated using APB-1 benchmark

    UML4NoSQL: A Novel Approach for Modeling NoSQL Document-Oriented Databases Based on UML

    No full text
    The adoption of Big Data systems by the companies is relatively new, although the data modeling and system design are ages old. Despite the fact that traditional databases are built on solid foundations, they cannot handle the swift and massive flow of data coming from multiple different sources. Herein, NoSQL databases are an inevitable alternative. However, these systems are schemaless compared to traditional databases. It is important to emphasize that schemaless does not mean no-schema which would mean that NoSQL databases do not need modeling. Hence, there is a need for conceptual models to define the data structure in these databases. This paper sheds a light on the importance of the UML in showing how to store Big Data described through meta-models within NoSQL databases. We propose a novel Big Data modeling methodology for NoSQL databases called UML4NoSQL, which is independent of the target system, and taking into account the four Big Data characteristics: Variety, Volume, Velocity, and Veracity (4 V's). The approach relies on the UML blocks with a data-up technique; it starts with a use-case and the class diagram resulting from the understanding of the data at hand and the definition of the developer's strategies while focusing on the user's needs. To illustrate our approach, we take a case study from health care domain. We show that our approach produces designs that can be implemented on NoSQL document-oriented system with respect to Big Data 4 V's

    Textual aggregation approaches in OLAP context: A survey

    No full text
    International audienceIn the last decade, OnLine Analytical Processing (OLAP) has taken an increasingly important role as a research field. Solutions, techniques and tools have been provided for both databases and data warehouses to focus mainly on numerical data. however these solutions are not suitable for textual data. Therefore recently, there has been a huge need for new tools and approaches that treat and manipulate textual data and aggregate it as well. Textual aggregation techniques emerge as a key tool to perform textual data analysis in OLAP for decision support systems. This paper aims at providing a structured and comprehensive overview of the literature in the field of OLAP Textual Aggregation. We provide a new classification framework in which the existing textual aggregation approaches are grouped into two main classes, namely approaches based on cube structure and approaches based on text mining. We discuss and synthesize also the potential of textual similarity metrics, and we provide a recent classification of them

    Efficiently mining frequent itemsets applied for textual aggregation

    No full text
    International audienceAbstract Text mining approaches are commonly used to discover relevant information and relationships in hugeamounts of text data. The term data mining refers to methods for analyzing data with the objective of finding patternsthat aggregate the main properties of the data. The merger between the data mining approaches and on-line analyticalprocessing (OLAP) tools allows us to refine techniques used in textual aggregation. In this paper, we propose a novel aggregation function for textual data based on the discovery of frequent closed patterns in a generated documents/keywords matrix. Our contribution aims at using a data mining technique, mainly a closed pattern mining algorithm, to aggregate keywords. An experimental study on areal corpus of more than 700 scientific papers collected on Microsoft Academic Search shows that the proposed algorithm largely outperforms four state-of-the-art textual aggregation methods in terms of recall, precision, F-measure and runtime
    corecore